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Abstract 

Several variants of the Long Short-Term Mem¬ 
ory (LSTM) architecture for recurrent neural net¬ 
works have been proposed since its inception in 
1995. In recent years, these networks have be¬ 
come the state-of-the-art models for a variety of 
machine learning problems. This has led to a re¬ 
newed interest in understanding the role and util¬ 
ity of various computational components of typ¬ 
ical LSTM variants. In this paper, we present 
the first large-scale analysis of eight LSTM vari¬ 
ants on three representative tasks: speech recog¬ 
nition, handwriting recognition, and polyphonic 
music modeling. The hyperparameters of all 
LSTM variants for each task were optimized sep¬ 
arately using random search and their impor¬ 
tance was assessed using the powerful fANOVA 
framework. In total, we summarize the results 
of 5400 experimental runs (« 15 years of CPU 
time), which makes our study the largest of its 
kind on LSTM networks. Our results show 
that none of the variants can improve upon the 
standard LSTM architecture significantly, and 
demonstrate the forget gate and the output activa¬ 
tion function to be its most critical components. 

We further observe that the studied hyperparam¬ 
eters are virtually independent and derive guide¬ 
lines for their efficient adjustment. 

1. Introduction 

Recurrent neural networks with Long Short-Term Memory 
(which we will concisely refer to as LSTMs) have emerged 
as an effective and scalable model for several learning 


problems related to sequential data. Earlier methods 
for attacking these problems were usually hand-designed 
workarounds to deal with the sequential nature of data such 
as language and audio signals. Since LSTMs are effec¬ 
tive at capturing long-term temporal dependencies without 
suffering from the optimization hurdles that plague sim¬ 
ple recurrent networks (SRNs) (Hochreiter, 1991; Bengio 
et al., 1994), they have been used to advance the state of the 
art for many difficult problems. This includes handwriting 
recognition (Graves et al., 2009; Pham et al., 2013; Doetsch 
et al., 2014) and generation (Graves et al., 2013), language 
modeling (Zaremba et al., 2014) and translation (Luong 
et al., 2014), acoustic modeling of speech (Sak et al., 2014), 
speech synthesis (Fan et al., 2014), protein secondary struc¬ 
ture prediction (Spnderby & Winther, 2014), analysis of 
audio (Marchi et al., 2014), and video data (Donahue et al., 
2014) among others. 

The central idea behind the LSTM architecture is a memory 
cell which can maintain its state over time, and non-linear 
gating units which regulate the information flow into and 
out of the cell. Most modem studies incorporate many im¬ 
provements that have been made to the LSTM architecture 
since its original formulation (Hochreiter & Schmidhuber, 
1995; 1997). However, LSTMs are now applied to many 
learning problems which differ significantly in scale and 
nature from the problems that these improvements were ini¬ 
tially tested on. A systematic study of the utility of various 
computational components which comprise LSTMs (see 
Figure 1) was missing. This paper fills that gap and sys¬ 
tematically addresses the open question of improving the 
LSTM architecture. 

We evaluate the most popular LSTM architecture {vanilla 
LSTM; Section 2) and eight different variants thereof on 
three benchmark problems: acoustic modeling, handwrit- 
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Figure 1. Detailed schematic of the Simple Recurrent Network (SRN) unit (left) and a Long Short-Term Memory block (right) as used 
in the hidden layers of a recurrent neural network. 


ing recognition and polyphonic music modeling. Each one 
differs from the vanilla LSTM by a single change. This al¬ 
lows us to isolate the effect of each of these changes on the 
performance of the architecture. Random search (Ander¬ 
son, 1953; Solis & Wets, 1981; Bergstra & Bengio, 2012) is 
used to find the best performing hyperparameters for each 
variant on each problem, enabling a reliable comparison of 
the performance of different variants. We also provide in¬ 
sights gained about hyperparameters and their interaction 
using fANOVA (Hutter et al., 2014). 

2. Vanilla LSTM 

The LSTM architecture most commonly used in litera¬ 
ture was originally described by Graves & Schmidhuber 
(2005).^ We refer to it as vanilla LSTM and use it as a 
reference for comparison of all the variants. The vanilla 
LSTM incorporates changes by Gers et al. (1999) and Gers 
& Schmidhuber (2000) into the original LSTM (Hochreiter 
& Schmidhuber, 1997) and uses full gradient training. Sec¬ 
tion 3 provides descriptions of these major LSTM changes. 

A schematic of the vanilla LSTM block can be seen in Fig¬ 
ure 1. It features three gates (input, forget and output), 
block input, a single cell (the Constant Error Carousel), 
an output activation function, and peephole connections. 
The output of the block is recurrently connected back to 
the block input and all of the gates. 

The vector formulas for a vanilla LSTM layer forward 
pass are given below. The corresponding Back-Propagation 
Through Time (BPTT) formulas can be found in supple- 

^But note that some studies omit peephole connections. 


mentary material. Here is the input vector at time t, the 
W are rectangular input weight matrices, the R are square 
recurrent weight matrices, the p are peephole weight vec¬ 
tors and b are bias vectors. Functions a, g and h are 
point-wise non-linear activation functions: logistic sigmoid 
) is used for as activation function of the gates and 
hyperbolic tangent is usually used as the block input and 
output activation function. The point-wise multiplication 
of two vectors is denoted with 0: 


= 5(W,x* + R,y*-i+b,) 

block input 

= (T(WiX* + + Pi © + bi) 

input gate 

= a{WfX* + R/y*“^ + p/ 0 + b/) 

forget gate 

= i* 0 z* + f* 0 

cell state 

= (t(Wox‘ + Roy*“^ + Po © c* + bo) 

output gate 

= o* © h{c*) 

block output 


3. History of LSTM 

3.1. Original Formulation 

This initial version of the LSTM block (Hochreiter & 
Schmidhuber, 1995; 1997) included (possibly multiple) 
cells, input and output gates, but no forget gate and no 
peephole connections. The output gate, unit biases, or input 
activation function were omitted for certain experiments. 
Training was done using a mixture of Real Time Recur¬ 
rent Learning (RTRL) and Backpropogation Through Time 
(BPTT). Only the gradient of the cell was propagated back 
through time, and the gradient for the other recurrent con- 
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nections was truncated. Thus, that study did not use the 
exact gradient for training. Another feature of that version 
was the use of full gate recurrence, which means that all 
the gates received recurrent inputs from all gates at the pre¬ 
vious time-step in addition to the recurrent inputs from the 
block outputs. This feature did not appear in any of the 
later papers. 

3.2. Forget Gate 

The first paper to suggest a modification of the LSTM ar¬ 
chitecture introduced the forget gate (Gers et al., 1999), en¬ 
abling the LSTM to reset its own state. This allowed learn¬ 
ing of continual tasks such as embedded Reber grammar. 

3.3. Peephole Connections 

Gers & Schmidhuber (2000) argued that in order to learn 
precise timings, the cell needs to control the gates. So far, 
this was only possible through an open output gate. Peep¬ 
hole connections (connections from the cell to the gates, 
blue in Figure 1) were added to the architecture in order 
to make precise timings easier to learn. Additionally, the 
output activation function was omitted, as there was no ev¬ 
idence that it was essential for solving the problems that 
LSTM had been tested on so far. 

3.4. Full Gradient 

The final modification towards the vanilla LSTM was done 
by Graves & Schmidhuber (2005). This study presented 
the full backpropagation through time (BPTT) training for 
LSTM networks with the architecture described in Sec¬ 
tion 2, and presented results on the TIMIT benchmark. Us¬ 
ing full BPTT had the added advantage that LSTM gra¬ 
dients could be checked using finite differences, making 
practical implementations more reliable. 

3.5. Other Variants 

Since its introduction the vanilla LSTM has been the most 
commonly used architecture, but other variants have been 
suggested too. Before the introduction of full BPTT train¬ 
ing, Gers et al. (2002) utilized a training method based on 
Extended Kalman Filtering which enabled the LSTM to be 
trained on some pathological cases at the cost of high com¬ 
putational complexity. Schmidhuber et al. (2007) proposed 
using a hybrid evolution-based method instead of BPTT for 
training but retained the vanilla LSTM architecture. 

Bayer et al. (2009) evolved different LSTM block architec¬ 
tures that maximize fitness on context-sensitive grammars. 
Sak et al. (2014) introduced a linear projection layer that 
projects the output of the LSTM layer down before recur¬ 
rent and forward connections in order to reduce the amount 
of parameters for LSTM networks with many blocks. By 


introducing a trainable scaling parameter for the slope of 
the gate activation functions, Doetsch et al. (2014) were 
able to improve the performance of LSTM on an offline 
handwriting recognition dataset. In what they call Dynamic 
Cortex Memory, Otte et al. (2014) improved convergence 
speed of LSTM by adding recurrent connections between 
the gates of a single block (but not between the blocks). 

Cho et al. (2014) proposed a simplification of the LSTM ar¬ 
chitecture called Gated Recurrent Unit (GRU). They used 
neither peephole connections nor output activation func¬ 
tions, and coupled the input and the forget gate into an up¬ 
date gate. Finally, their output gate (called reset gate) only 
gates the recurrent connections to the block input (W;^). 
Chung et al. (2014) performed an initial comparison be¬ 
tween GRU and LSTM and reported mixed results. 

4. Evaluation Setup 

The focus of our study is to compare different LSTM vari¬ 
ants, and not to achieve state-of-the-art results. Therefore, 
our experiments are designed to keep the setup simple and 
the comparisons fair. The vanilla LSTM is used as a base¬ 
line and evaluated together with eight of its variants. Each 
variant adds, removes or modifies the baseline in exactly 
one aspect, which allows to isolate their effect. Three dif¬ 
ferent datasets from different domains are used to account 
for cross-domain variations. 

Since hyperparameter space is large and impossible to tra¬ 
verse completely, random search was used in order to ob¬ 
tain the best-performing hyperparameters (Bergstra & Ben- 
gio, 2012) for every combination of variant and dataset. 
Thereafter, all analyses focused on the 10% best perform¬ 
ing trials for each variant and dataset (Section 5.1), making 
the results representative for the case of reasonable hyper¬ 
parameter tuning efforts. Random search was also chosen 
for the added benefit of providing enough data for analyz¬ 
ing the general effect of various hyperparameters on the 
performance of each LSTM variant (Section 5.2). 

4.1. Datasets 

Each dataset is split into three parts: a training set, a valida¬ 
tion set, which is used for early stopping and for optimizing 
the hyperparameters, and a test set for the final evaluation. 
Details of preprocessing for each dataset are provided in 
the supplementary material. 

4.1.1. TIMIT 

The TIMIT Speech corpus (Garofolo et al., 1993) is large 
enough to be a reasonable acoustic modeling benchmark 
for speech recognition, yet it is small enough to keep a 
large study such as ours manageable. Our experiments fo¬ 
cus on the frame-wise classification task for this dataset. 
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where the objective is to classify each audio-frame as one 
of 61 phones. The performance is measured as classifica¬ 
tion error percentage. The training, testing and validation 
sets are split in line with Halberstadt (1998) into 3696, 400 
and 192 sequences, having 304 frames on average. 

4.1.2. lAM Online 

The lAM Online Handwriting Database (Liwicki & Bunke, 
2005) consists of English sentences as time series of pen 
movements that have to be mapped to characters. The net¬ 
work uses four input features: the change in x and y pen po¬ 
sitions, the time since the current stroke started and a binary 
value indicating whether the pen is lifted. The performance 
is measured in terms of the Character Error Rate (CER) af¬ 
ter decoding. The size of the dataset was halved by sub¬ 
sampling, which makes the experiments to run 2x faster 
without harming the performance. The training, testing and 
validation sets contained 5355, 2956, 3859 sequences with 
an average length of 334 frames. 

4.1.3. JSB Chorales 

JSB Chorales (Allan & Williams, 2005) is a polyphonic 
music modeling dataset. The preprocessed data consists of 
sequences of binary vectors and the task is next-step pre¬ 
diction. The performance metric used is the negative log- 
likelihood on the validation/test set. The complete dataset 
consists of 229, 76, and 77 sequences respectively with an 
average length of 61. 

4.2. Network Architectures & Training 

Bidirectional LSTM (Graves & Schmidhuber, 2005), 
which consists of two hidden layers, one processing the 
input forwards and the other one backwards in time, both 
connected to a single softmax output layer, was used for 
TIMIT and I AM Online tasks. A normal LSTM with one 
hidden layer and a sigmoid output layer was used for the 
JSB Chorales task. As loss function we employed Cross- 
Entropy Error for TIMIT and JSB Chorales, while for the 
lAM Online task the Connectionist Temporal Classification 
(CTC) Error by Graves et al. (2006) was used. The initial 
weights for all networks were drawn from a normal dis¬ 
tribution with standard deviation of 0.1. Training was done 
using Stochastic Gradient Descent with Nesterov-style mo¬ 
mentum (Sutskever et al., 2013) with updates after each 
sequence. The learning rate was rescaled by a factor of 
(1 — momentum). Gradients were computed using full 
BPTT for LSTMs (Graves & Schmidhuber, 2005). Train¬ 
ing stopped after 150 epochs or once there was no improve¬ 
ment on the validation set for more than fifteen epochs. 


4.3. LSTM Variants 

The vanilla LSTM from Section 2 is referred as Vanilla (V). 
The derived eight variants of the V architecture are the fol¬ 
lowing: 

1. No Input Gate (NIG) 

2. No Eorget Gate (NEG) 

3. No Output Gate (NOG) 

4. No Input Activation Eunction (NIAE) 

5. No Output Activation Eunction (NOAE) 

6. No Peepholes (NP) 

7. Coupled Input and Eorget Gate (CIEG) 

8. Lull Gate Recurrence (EGR) 

The first six variants are self-explanatory. The CIEG vari¬ 
ant uses only one gate for gating both the input and the cell 
recurrent self-connection (an LSTM modification proposed 
in GRU (Cho et al., 2014)). This is equivalent to setting 
ft = 1 — it instead of learning the forget gate weights in¬ 
dependently. The EGR variant adds recurrent connections 
between all the gates as in the original formulation of the 
LSTM (Hochreiter & Schmidhuber, 1997). It adds nine 
additional recurrent weight matrices, thus significantly in¬ 
creasing the number of parameters. 

4.4. Hyperparameter Search 

While there are other methods to efficiently search for good 
hyperparameters (cf. Snoek et al. 2012; Hutter et al. 201 1), 
random search has a couple of advantages for our setting: 
it is easy to implement, trivial to parallelize and covers 
the search space more uniformly, thereby improving the 
follow-up analysis of hyperparameter importance. 

Each hyperparameter search consists of 200 trials (for a 
total of 5400 trials) of randomly sampling the following 
hyperparameters: 

• number of LSTM blocks per hidden layer: log-uniform 
samples from [20, 200]; 

• learning rate: log-uniform samples from [10“^,10“^]; 

• momentum: 1 — log-uniform samples from [0.01,1.0]; 

• standard deviation of Gaussian input noise: uniform 
samples from [0,1]. 

In the case of the TIMIT dataset, two additional (boolean) 
hyperparameters were considered (not tuned for the other 
two datasets): 

The first one was the choice between traditional momen¬ 
tum and Nesterov-style momentum (Sutskever et al., 2013). 
Our analysis showed that this had no measurable effect 
on performance so the latter was arbitrarily chosen for all 
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further experiments. The second one was whether to clip 
the gradients to the range [—1,1]. This turned out to hurt 
overall performance,^ therefore the gradients were never 
clipped in the case of the other two datasets. 

Note that, unlike an earlier small scale study (Chung et al., 
2014), the number of parameters was not kept fixed for all 
variants. Since different variants can utilize their parame¬ 
ters differently, fixing this number can bias comparisons. 

5. Results & Discussion 

Each of the 5400 experiments was run on one of 128 AMD 
Opteron CPUs at 2.5 GHz and took 24.3 h on average to 
complete. This sums up to a total single-CPU computa¬ 
tion time of just below 15 years. For TIMIT and JSB 
Chorales the test set performance of the best setup were 
29.6% classification error (CIFG) and a log-likelihood of 
-8.38 (NIG) respectively. For the lAM Online dataset our 
best result was a Character Error Rate of 9.26% (NP) on 
the test set. The best previously published result is 11.5% 
CER by Graves et al. (2008) using a different and much 
more extensive preprocessing. ^ 

5.1. Comparison of the Variants 

The twenty best runs (according to validation set perfor¬ 
mance) for each variant were compared with those for the 
baseline architecture (V). Welch’s t-test at a significance 
level of p = 0.05 was used^ to determine whether the mean 
test set performance of each variant was significantly dif¬ 
ferent from that of the baseline. A summary of the results is 
shown in Figure 2. The box for each variant for which the 
mean differs significantly from the baseline is highlighted 
in blue. The mean number of parameters used by those 
twenty best performing networks are also shown as grey 
bar plots in the background. 

5.1.1. General Observations 

The first important observation based on Figure 2 is that 
removing the output activation function (NOAF) or the for¬ 
get gate (NFG) significantly hurt performance on all three 
datasets. Apart from the CEC, the ability to forget old in¬ 
formation and the squashing of the cell state appear to be 
critical for the LSTM architecture. Indeed, without the out- 

^Although this may very well be the result of the range having 
been chosen too tightly. 

^Note that these numbers differ from the best test-set perfor¬ 
mances that can be found in Figure 2. This is the case because 
here we only report the single best performing setup as deter¬ 
mined on the validation set. In Figure 2 on the other hand we 
show the test-set performance of the 20 best setups for each vari¬ 
ant. 

"^We applied the Bonferroni adjustment to correct for perform¬ 
ing eight different tests (one for each variant). 


V CIFG FGR NP NOG NIAF NIG NFG NOAF 



Figure 2. Test set performance for top 10% (according to the val¬ 
idation set) hyperparameter settings for each dataset and variant. 
Boxes show the range between the 25* and the 75* percentile of 
the data, while the whiskers indicate the whole range. The red dot 
represents the mean and the red line the median of the data. The 
boxes of variants that differ significantly from the vanilla LSTM 
are shown in blue with thick lines. The grey histogram in the 
background presents the average number of parameters for the 
top 10% performers of every variant. Figure best viewed in color. 
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icant) average performance improvement was observed for 
the NIG and NIAF architectures on music modeling. We 
hypothesize that these behaviors will generalize to simi¬ 
lar problems such as language modeling. For supervised 
learning on continuous real-valued data (such as speech 
and handwriting recognition), the input gate, output gate 
and input activation function are all crucial for obtaining 
good performance. 
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Figure 3. Pie charts showing which fraction of variance of the test 
set performance can be attributed to each of the hyperparameters. 
The percentage of variance that is due to interactions between 
multiple parameters is indicated as “higher order.” 


put activation function, the block output can in principle 
grow unbounded. Coupling the input and the forget gate 
avoids this problem and might render the use of an out¬ 
put non-linearity less important, which could explain why 
GRU performs well without it. 

Input and forget gate coupling (CIFG) did not significantly 
change mean performance on any of the datasets, although 
the best performance improved slightly on music mod¬ 
elling. Similarly, removing peephole connections (NP) also 
did not lead to significant changes, but the best perfor¬ 
mance improved slightly for handwriting recognition. Both 
of these variants simplify LSTMs and reduce the computa¬ 
tional complexity, so it might be worthwhile to incorporate 
these changes into the architecture. 

Adding full gate recurrence (FGR) did not significantly 
change performance on TIMIT or lAM Online, but led to 
worse results on the JSB Chorales dataset. Given that this 
variant greatly increases the number of parameters, we gen¬ 
erally advice against using it. Note that this feature was 
present in the original proposal of the LSTM (Hochreiter 
& Schmidhuber, 1995; 1997), but has been absent in all 
following studies. 


5.2. Impact of Hyperparameters 

The fANOVA framework for assesing hyperparameter im¬ 
portance by Hutter et al. (2014) is based on the observation 
that marginalizing over dimensions can be done efficiently 
in regression trees. This allows predicting the marginal er¬ 
ror for one hyperparameter while averaging over all the oth¬ 
ers. Traditionally this would require a full hyperparameter 
grid search, whereas here the hyperparameter space can be 
sampled at random. 

Average performance for any slice of the hyperparameter 
space is obtained by first training a regression tree and then 
summing over its predictions along the corresponding sub¬ 
set of dimensions. To be precise, a random regression- 
fore St of 100 trees is trained and their prediction perfor¬ 
mance is averaged. This improves the generalization and 
allows for an estimation of uncertainty of those predictions. 
The obtained marginals can then be used to decompose 
the variance into additive components using the functional 
ANalysis Of VAriance (fANOVA) method (Hooker, 2007) 
which provides an insight into the overall importance of 
hyperparameters and their interactions. 

5.2.1. Analysis Of Variance 

Figure 3 shows what fraction of the test set performance 
variance can be attributed to different hyperparameters. It 
is obvious that the learning rate is by far the most impor¬ 
tant hyperparameter, always accounting for more than two 
thirds of the variance. The next most important hyper¬ 
parameter is the hidden layer size, followed by the input 
noise, leaving the momentum with less than one percent of 
the variance. Higher order interactions play an important 
role in the case of TIMIT, but are much less important for 
the other two data sets. The hyperparameter interplay is 
further discussed in Section 5.2.6. 

5.2.2. Learning rate 


5.1.2. Task-specific observations 

Removing the input gate (NIG), the output gate (NOG) and 
the input activation function (NIAF) led to a significant re¬ 
duction in performance on speech and handwriting recog¬ 
nition. However, there was no significant effect on music 
modelling performance. A small (but statistically insignif- 


Learning rate is the most important hyperparameter, there¬ 
fore it is very important to understand how to set it correctly 
in order to achieve good performance. Figure 4 shows (in 
blue) how setting the learning rate value affects the pre¬ 
dicted average performance on the test set. It is important 
to note that this is an average over all other hyperparam¬ 
eters and over all the trees in the regression-forest. The 
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shaded area around the curve indicates the standard devia¬ 
tion over tree predictions (not over other hyperparameters), 
thus quantifying the reliability of the average. The same is 
shown in green with the predicted average training time. 

The plots in Figure 3 show that the optimal value for the 
learning rate is dependent on the dataset. For each dataset, 
there is a large basin (up to two orders of magnitude) of 
good learning rates inside of which the performance does 
not vary much. A related but unsurprising observation is 
that there is a sweet-spot for the learning rate at the high 
end of the basin. ^ In this region, the performance is good 
and the training time is small. So while searching for a 
good learning rate for the LSTM, it is sufficient to do a 
coarse search by starting with a high value (e.g. 1.0) and 
dividing it by ten until performance stops increasing. 

Figure 3 also shows that the fraction of variance caused 
by the learning rate is much bigger than the fraction due 
to interaction between learning rate and hidden layer size 
(some part of the “higher order” piece, for more see Sec¬ 
tion 5.2.6). This suggests that the learning rate can be 
quickly tuned on a small network and then used to train 
a large one. 

5.2.3. Hidden Layer Size 

Not surprisingly the hidden layer size is an important hy¬ 
perparameter affecting the LSTM network performance. 
As expected, larger networks perform better, and the re¬ 
quired training time increases with the network size. 

5.2.4. Input Noise 

Additive Gaussian noise on the inputs, a traditional regular- 
izer for neural networks, has been used for LSTM as well. 
However, we find that not only does it almost always hurt 
performance, it also slightly increases training times. The 
only exception is TIMIT, where a small dip in error for the 
range of [0.2, 0.5] is observed. 

5.2.5. Momentum 

One unexpected result of this study is that momentum af¬ 
fects neither performance nor training time in any signifi¬ 
cant way. This follows from the observation that for none 
of the datasets, momentum accounted for more than 1% of 
the variance of test set performance. It should be noted that 
for TIMIT the interaction between learning rate and mo¬ 
mentum accounts for 2.5% of the total variance, but as with 
learning rate x hidden size (cf. Section 5.2.6) it does not 
reveal any interpretable structure. This may be the result of 
our choice to scale learning rates dependent on momentum 
(Section 4.2). These observations suggest that momentum 

^Note that it is outside the plotted range for lAM Online and 
JSB Chorales. 



Figure 5. Left: The predicted marginal error for combinations of 
learning rate and hidden size. Right: The component that is 
solely due to the interaction of the two and cannot be attributet 
to changes in one of them alone. In other words the difference to 
the case of them being perfectly independent. (Blue is better than 
red.) 


does not offer substantial benefits when training LSTMs 
with online stochastic gradient descent. It may, however, 
be more important in the case of batch training, where the 
gradients are less noisy. 

5.2.6. Interaction of Hyperparameters 

Here we focus on the higher order interactions for the 
TIMIT dataset, for which they were strongest, but our anal¬ 
ysis revealed very similar behavior for the other datasets: 


learning rate x hidden size = 6.7% 
learning rate x input noise = 4.4% 
hidden size x input noise =2.0% 
learning rate x momentum =1.5% 
momentum x hidden size = 0.6% 
momentum x input noise = 0.4% 

The interaction between learning rate and the hidden size 
is the strongest one, but Figure 5 does not reveal any sys¬ 
tematic dependence between the two. In fact it may be the 
case that more samples would be needed in order to prop¬ 
erly analyse the fine interplay between them, but given our 
observations so far this might not be worth the effort. In 
any case, it is clear that varying the hidden size does not 
change the region of optimal learning rate. 

6. Conclusion 

This paper reports the results of a large scale study on vari¬ 
ations of the LSTM architecture. We conclude that: 

• The most commonly used LSTM architecture (vanilla 
LSTM) performs reasonably well on various datasets 
and using any of eight possible modifications does not 
significantly improve the LSTM performance. 
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Figure 4. Predicted marginal error (blue) and marginal time for different values of the learning rate, hidden size, and the input noise 
(columns) for all three datasets (rows). The shaded area indicates the standard deviation between the tree-predicted marginals and thus 
indicating the reliability of the predicted mean performance. Note that each plot is for the vanilla LSTM but curves for all variants that 
are not significantly worse look very similar. 


• Certain modifications such as coupling the input 
and forget gates or removing peephole connections 
simplify LSTM without significantly hurting perfor¬ 
mance. 

• The forget gate and the output activation function are 
the critical components of the LSTM block. While 
the first is crucial for LSTM performance, the second 
is necessary whenever the cell state is unbounded. 

• Learning rate and network size are the most crucial 
tunable LSTM hyperparameters. Surprisingly, the use 
of momentum was found to be unimportant (in our set¬ 
ting of online gradient descent). Gaussian noise on the 
inputs was found to be moderately helpful for TIMIT, 
but harmful for other datasets. 

• The analysis of hyperparameter interactions revealed 
that even the highest measured interaction (between 
learning rate and network size) is quite small. This 
implies that the hyperparameters can be tuned inde¬ 
pendently. In particular, the learning rate can be cali¬ 
brated first using a fairly small network, thus saving a 
lot of experimentation time. 

Neural networks can be tricky to use for many practition¬ 
ers compared to other methods whose properties are al¬ 
ready well understood. This has remained a hurdle for 


newcomers to the field since a lot of practical choices are 
based on the intuitions of experts, and experiences gained 
over time. With this study, we have attempted to back 
some of these intuitions with experimental results. We have 
also presented new insights, both on architecture selection 
and hyperparameter tuning for LSTM networks which have 
emerged as the method of choice for solving complex se¬ 
quence learning problems. In future work, we will explore 
more complex modifications of the LSTM architecture. 
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A. LSTM formulas 

Here we repeat the vectorized formulas for a vanilla LSTM 
layer forward pass from the paper, and then present the for¬ 
mulas for the backward pass. We also provide formulas for 
all the studied variants. 

A.l. Forward Pass 

Here we reproduce the formulas of the forward pass from 
the paper, but we split all gates and the block input into 
activity before (^) and after non-linearity (^). 

Let N be the number of LSTM blocks and M the number 
of inputs. Then we get the following weights: 

• Input weights: W^, W„ W/, G 

• Recurrent weights: R^, R^, Rj, R^ G 

• Peephole weights: Ps, p/, Po G 

• Bias weights: b^, b^, b/, bo G 

As in the paper we have as the input vector at time t, 
cr, g and h are pointwise non-linear functions with cr(x) = 
being the logistic sigmoid. The pointwise multipli¬ 
cation of two vectors is denoted with ©. 


= W^x^ + R^y^ ^ + hz 

= g{z^) block input 

? = W^x' + R^y'-i + p^ 0 c'-i + b^ 

= cr(i^) input gate 

f^ = W/x^ + + P/ © + hf 

f^ = cr(f^) forget gate 

= z^ © © f^ cell 

6^ = WoX^ + Roy^“^ + Po © + bo 

= cr(o^) output gate 

y^ = /i(c^) © block output 


A.2. Backpropagation Through Time 

Here is the vector of deltas passed down from the layer 
above. If E is the loss function it formally corresponds 


to but not including the recurrent dependencies. The 
deltas inside the LSTM block are then calculated as fol¬ 
lows: 


Syt = + Rf^z^+^ + + Rj^f^+^ -f- R^^o^+^ 

= 5y^ © h{c^) © cr'(o^) 

Sc^ = Sy^ © © h'{c^) + Po © + Pi © 

+ p/©^f'+^+^c'+^©f'+i 

Sf =Sc^ Gc^-^ G a'{P) 

Si^ = Sc^ © z^ © cr'(i^) 

5z^ = 5c^ © © g'{z^) 


The deltas for the inputs are only needed if there is a layer 
below that needs training: 


^x^ = W^Sz^ + Wf + Wj^f^ + W^So^ 


Finally the gradients for the weights are calculated like this, 
where again -k can be any of {z, i, f, o}: 


T 

^Pi = X] ® 

t=0 

t=0 

T-1 

T-1 

,5R* = ^(.5V+i,y*) 

5pf = ^ © ^f^^^ 

t=0 

t=0 

T 

T 

<5b* = <5** 

t=0 

Spo = ® 

t=0 


Here (^i, ^ 2 ) denotes the outer product of two vectors. 
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A.3. Variants 

We only report differences to the formulas from Sec¬ 
tion A. 1 : 

1. No Input Gate (NOG): i* = 1 

2. No Forget Gate (NFG): f* = 1 

3. No Output Gate (NIG): o* = 1 

4. No Input Activation Function (NIAF): g{x) = x 

5. No Output Activation Function (NOAF): /i(x) = x 

6. Coupled Input and Forget Gate (CIFG): 

f * = 1 - i* 

7. No Peepholes (NP): 

I* = WiX* + + hi 

f* = W/x* + R/y*“i +b/ 

6* = WqX* + Roy*“^ + bo 

8. Full Gate Recurrence (FGR): 

=WiX^ + + Pi © + bi 

+ Riii^ ^ ^ RoiO^ ^ 

P =W/x' + R/y'-^ + p/ 0 c'-i + hf 
+ Ri/i^ ^ © R/jf^ ^ + Rqjo^ ^ 
=WoX^ + Roy^“^ + Po © + bo 

+ Riol^ ^ ^ © RooO^ ^ 


B. Datasets 

This section provides details on the datasets and their pre¬ 
processing that were used in the LSTM comparison tasks. 

B.l. TIMIT 

We use the TIMIT Speech corpus (Garofolo et ah, 1993) for 
framewise phone classification. The full set of 61 phones 
were used as targets. From the raw audio we extract 12 Mel 
Frequency Cepstrum Coefficients (MFCCs) (Mermelstein, 
1976) -I- energy over 25ms hamming-windows with stride 
of 10ms and a pre-emphasis coefficient of 0.97. These 13 
inputs along with their first and second derivatives com¬ 
prise the 39 inputs to the network and are normalized to 
have mean 0 and variance of 1. 

We restrict our study to an established subset of the full 
TIMIT corpus as detailed by Halberstadt (1998). In short 
that means we only use the core tests set and drop the SA 
samples from the training set. For validation we use some 
of the discarded samples from the full test set. 




( . /-u> 

(a) 




o 


Ben Zoma said: "The days of Ithy 
life means in the day-time; all the days 
of Ithy life means even at night-time ." 
(Berochoth .) And the Rabbis thought 
it important that when we read the 


(b) 

Figure 6. (a) Example board (a08-551z, training set) from the 
lAM-OnDB dataset and (b) its transcription into character label 
sequences. 


B.2. lAM Online 

The lAM On-Line Handwriting Database (lAM-OnDB; Li- 
wicki & Bunke 2005)^ was used for the handwriting exper¬ 
iments in the lAM Online task. The lAM-OnDB dataset 
splits into one training, two validation sets and one test set, 
having 775,192, 216 and 544 boards each. Each board, see 
6(a), contains multiple hand-written lines. Each line splits 
into strokes represented by sequences of 3-dimensional 
vectors of x, ^ (a pen position) and t (time) coordinates. 
Begins and ends of the characters within each stroke are 
not explicitely marked. The stroke data were joint together 
and a fourth dimension that contains value of 1 at the time 
of the pen lifting (a transition to the next stroke) and zeroes 
at all other time steps. Each handwriting line is accompa¬ 
nied with a target character sequence, see 6(b) assembled 
from the following 81 ASCII characters: 

abodefghijklmnopqrstuvwxyz 
ABCDEFGHIJKLMNOPQRSTUVWXYZ 
0123456789 !"#&\'()^+,-./[]:;? 

The board labeled as a 0 8 - 5 51 z (in the training set) con¬ 
tains a sequence of 11 percent (%) characters that does not 
have an image in the strokes and the percent character does 
not occur in any other board. The board was removed from 
the experiments. 

^The lAM-OnDB was obtained from http://www. 
iam.unibe.ch/fki/databases/iam-on-line¬ 
handwriting-database 
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Figure 7. Test set performance of for all 200 runs for each dataset and variant. Boxes show the range between the 25* and the 75* 
percentile of the data, while the whiskers indicate the whole range. The red dot represents the mean and the red line the median of the 
data. The boxes of variants that differ significantly from the vanilla LSTM are shown in blue with thick lines. The grey histogram in the 
background presents the average number of parameters for every variant. 


The two validation sets were joint together. The training, 
validation and testing sets contain 5 355, 2 956 and 3 859 
lines. The sequences were subsampled to half the length 
(they still contain enough information but it speeds up the 
training). Instead of absolute pen positions their differences 
were used. The data was standardized. No additional pre¬ 
processing (like base-line straightening, cursive correction 
etc.) was used. The CTC-error function by Graves et al. 
(2006) was used for labeling the 81 characters and best- 
path decoding was used for determining the Character Er¬ 
ror Rate. 

B. 3. JSB Chorales 

JSB Chorales is a collection of 382 four-part harmonized 
chorales by J. S. Bach (Allan & Williams, 2005), con¬ 
sisting of 202 chorales in major keys and 180 chorals in 
minor keys. We used the preprocessed piano-rolls pro¬ 
vided by Boulanger-Lewandowski et al. (2012) currently 
available at http ://www-etud. iro . umontreal. 
ca/~boulanni/icml2 012. These piano-rolls were 
generated by transposing each MIDI sequence in C major 
or C minor and sampling frames every quarter note. 

C. Additional plots 

Here we present some additional plots that didn’t make it 
into the paper. 

C.l. Full Boxplot for all Variants 

In Figure 7 a box-whiskers-plot of the performance over all 
200 runs (in contrast to only the top 20 as in the paper) is 
shown for every variant. 


C.2. Performance and Time Scatterplots 

A scatterplot of training time vs performance for all runs 
can be seen in Figure 9, 10, and 11. The individual variants 
are shown with different markers. We were hoping to iden¬ 
tify some clusters, along the pareto front of that tradeoff. 
But no such structure could be found. 

C.3. Hyperparameter Interactions 

In Figure 8 we visualize the interaction between all pairs 
of hyperparamters. It is divided vertically into three sub¬ 
plots, one for every dataset (TIMIT, lAM Online, and JSB 
Chorales). The subplots itself are divided horizontally into 
two parts, each containing a lower triangular matrix of 
heatmaps. The rows and columns of these matrices rep¬ 
resent the different hyperparameters (learning rate, mo¬ 
mentum, hidden size, and input noise) and there is one 
heatmap for every combination. The color encodes the 
performance as measured by the Classification Error for 
TIMIT, Character Error Rate for lAM Online and Nega¬ 
tive Log-Likelihood for the JSB Chorales Dataset. For all 
datasets low (blue) is better than high (red). 

Each heatmap in the left part shows marginal performance 
for different values of the respective two hyperparamers. 
This is the average performance predicted by the decision 
forest when marginalizing over all other hyperparameters. 
So each one is the 2D version of the performance plots from 
Figure 4 in the paper. 

The right side employs the idea of ANOVA to better illus¬ 
trate the interaction between the hyperparameters. So we 
removed the variance of performance that can be explained 
by varying a single hyperparameter and only plot what is 
left. For the case that two hyperparameters do not interact 
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at all (are perfectly independent) that residual would be all 
zero (grey). 

So look for example at the pair hidden size and learning 
rate on the left side for the TIMIT dataset. We can see 
that in general performance varies strongly along the x-axis 
(learning rate), first decreasing and then increasing again. 
This is what we would expect knowing the valley-shape of 
the learning rate from Figure 4 in the paper. Along the y- 
axis (hidden size) performance seems to decrease slightly 
from top to bottom. Again this is roughly what we would 
expect. 

Now let’s look at the same pair on the right side. This plot 
shows how the heatmap on the left differs from the case of 
the two hyperparameters being independent. So here a blue 
pixel means, that the marginal error for this combination 
of learning rate and hidden size is lower (better) than you 
would expect. You will notice the scale is much smaller for 
the right side (-3 to 3 as opposed to 32 to 60) and still many 
of the heatmaps are close to grey. 
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Figure 8. Total marginal predicted performance for all pairs of hyperparameters (left) and the variation only due to their interaction 
(right). 
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Figure 9. Scatterplots for all 1800 experiments of the TIMIT Dataset. We show performance on the x-axis vs training time on the y-axis 
(logarithmically). The variants are displayed with different colors or markers 
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Figure 10. Scatterplots for all 1800 experiments of the lAM Online Dataset. We show performance on the x-axis vs training time on the 
y-axis (logarithmically). The variants are displayed with different colors or markers 
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Figure 11. Scatterplots for all 1800 experiments of the JSB Chorales Dataset. We show performance on the x-axis vs training time on 
the y-axis (logarithmically). The variants are displayed with different colors or markers 









